Antibiotic Resistance

Project Group 10

Edene Levine · Molly Hong-Minh · Pablo Vivero · Vinit Nilesh Vasa

Data & Aims

Dataset Overview:
A collection of bacterial isolates with associated patient demographics, clinical factors and antibiotic susceptibility results.

Aims:
- Determine which clinical or individual variables significantly influence the number of resistances in an infection using linear models
- Explore antibiotic resistance profiles across species using visualisation techniques such as heatmaps
- Perform PCA to identify clustering based on resistance profiles


Data: https://www.kaggle.com/datasets/adilimadeddinehosni/multi-resistance-antibiotic-susceptibility/data

Data Processing & Description

Data Loading, Cleaning & Overview

Workflow

  • library(Rkaggle): Load data directly from Kaggle
  • Clean Data: Remove NAs, duplicates & unnecessary columns
  • Tidy Data: Restructure data into tidy format
  • Renaming: Identify naming inconsistencies & rename accordingly
  • Data Description: table1() and ggplot()

Table 1

library(here)
library(table1)

# Load Data
mdr_data <- read.csv(here("data/cleaned_bacteria_data.csv"))

# Create table
t1 <- table1(
  ~ Age + Gender + Diabetes + Hypertension + Hospital_before +
    Infection_Freq +
    AMX.AMP + AMC + CZ + FOX + CTX.CRO + IPM + GEN + AN +
    Acide.nalidixique + ofx + CIP + C + Co.trimoxazole +
    Furanes + colistine
  | Species,
  data = mdr_data
)

t1

Data Processing & Description

Data Loading, Cleaning & Overview

Workflow

  • library(Rkaggle): Load data directly from Kaggle
  • Clean Data: Remove NAs, duplicates & unnecessary columns
  • Tidy Data: Restructure data into tidy format
  • Renaming: Identify naming inconsistencies & rename accordingly
  • Data Description: table1() and ggplot()

Data Processing & Description

Data Loading, Cleaning & Overview

Workflow

  • library(Rkaggle): Load data directly from Kaggle
  • Clean Data: Remove NAs, duplicates & unnecessary columns
  • Tidy Data: Restructure data into tidy format
  • Renaming: Identify naming inconsistencies & rename accordingly
  • Data Description: table1() and ggplot()

Data Augmentation

New data frame in long format, and creation and modification of variables for the downstream analyses.

Workflow

  • tibble(): Creates a lookup table linking antibiotic codes with their full names and pharmacological classes
  • pivot_longer() + left_join(): Reshapes the dataset and merges antibiotic information
  • MDR variable: Converts R=1, I=0.5, S=0 across antibiotics and sums the values to obtain a resistance score
  • Age-group variable: Bins ages into 10-year categorical intervals

Code

#MDR variable
antibiotics_cols <- colnames(mdr_data)[c(8:ncol(mdr_data))]

#Apply the case_when function to mutate every antibiotic column and sum into MDR variable
#Count resistance is a function in 99_proj_func.R
mdr_wide <- mdr_data |>
  mutate(
    across(antibiotics_cols, 
           count_resistance)) |> 
  mutate(MDR = rowSums(across(antibiotics_cols))) |> 
  drop_na(Age, Gender, Species, Diabetes, Hypertension, Hospital_before, Infection_Freq, MDR)   #Drop all NA in the columns used to model
     
     ...

#Age-group variable
breaks <- seq(0, 90, 10)
labels <- paste0("[", head(breaks, -1), ",", tail(breaks, -1), "]")

mdr_wide <- mdr_wide |> 
  mutate(age_group = cut(
         x = Age,
         breaks = breaks,
             right = FALSE,
    include.lowest = TRUE,
    labels = labels))

Linear Model MDR

Analysis of how age, gender, bacterial species, diabetes, hypertension, prior hospitalisation and infection frequency influence the number of antibiotic resistances in a contracted infection using a linear model.

Workflow

  • lm(): Fits the linear model predicting MDR
  • broom::tidy(): Extracts coefficients, confidence intervals, and p-values
  • mutate(): Cleans variable names and adds significance flag
  • stringr::str_replace(): Standardises variable labels

Code

#Linear model
linear_model <- MDR_df |> 
  lm(formula = MDR ~ Age + Gender + Species + Diabetes + Hypertension + Hospital_before + Infection_Freq,
     data = _)
     
     ...

#Tidy format     
tidy_lm <- tidy(linear_model, 
     conf.int = TRUE,
     conf.level = 0.95) |> 
  mutate(term = str_replace(term, "Species", "")) |> 
  mutate(term = str_replace(term, "Yes", "")) |> 
  mutate(term = str_replace(term, "GenderM", "Gender_M")) |> 
  mutate(term = str_replace(term, "Freq1", "Freq_1")) |>
  mutate(term = str_replace(term, "Freq2", "Freq_2")) |>
  mutate(term = str_replace(term, "Freq3", "Freq_3")) |> 
  mutate(sig = factor(p.value < 0.05))

Linear Model MDR

Forest Plot

Volcano Plot

Only bacterial species significantly affect MDR

Linear Model Infection frequency

  • Evaluated whether any of the factors have any association with the infection frequency

  • Clinical factors show no association with infection frequency

  • Infection frequency varied significantly by the bacterial species

  • The number of drug resistances also shows no association with the infection frequency

  • The multivariable model shows that the bacterial species was the only meaningful contributor.

Model 1 Model 2 Model 3 Model 4
Patient details Bacterial species MDR Multivariate
p = 0.5394 p = 0.03842 p = 0.6002 p = 0.08977

Linear Model Infection frequency

Bacterial Resistance Heatmap

A resistance patterns across bacterial species using the long-format dataset: E. coli and K. pneumoniae show the highest resistance (red), most other species remain largely sensitive (blue), and several antibiotics stay effective across multiple species.

PCA

Identify shared patterns of antibiotic resistance across bacterial species.

Workflow

  • select(): Extracts only the antibiotic-resistance columns used for PCA
  • prcomp(): Performs the PCA with variable scaling
  • augment(): Attaches PCA scores (PC1, PC2…) back to the original dataset
  • count() — Determines number of isolates per species
  • slice_sample(): Downsamples species to equal sizes for balanced PCA

Code

#Variable selection and PCA
antibiotics_cols <- colnames(MDR_df)[c(8:(ncol(MDR_df)-1))]

pca_fit <- MDR_df |>
  select(all_of(antibiotics_cols)) |> 
  prcomp(scale = TRUE)
     
     ...

#Augment the original dataset   
pca_all_plot <- pca_fit |> 
  augment(MDR_df)
  
     ...
     
#Downsampling
min_n <- MDR_df |>
  count(Species) |> 
  summarise(min(n)) |> 
  pull()

balanced_df <- MDR_df |>
  group_by(Species) |>
  slice_sample(n = min_n) |>
  ungroup()

PCA

Before downsampling

After downsampling

After balancing species counts, the PCA shows largely shared resistance patterns across taxa, with only a few species, such as P. aeruginosa and S. marcescens, forming distinct clusters separate from the main E. coli-dominated continuum.